TOC 0. Objectives 1. Introduction 2. Alternative Splicing Data Analysis 3. Single-cell RNA-Seq Data Analysis 1) Clustering of Cells Not Using Marker Genes (Only SC RNA-Seq Data) 2) Single-cell DEG Analysis 3) Reconstruction of Single-Cell Trajectory 4. RNA Editing

0. Objectives

It’s possible to perform a detailed data analysis at nucleotide and single variant level with RNA-Seq. RNA-Seq is now being widely applied to not only transcript expression profile analysis, but also in miRNA-Seq and miRNA DEG, alternative splicing type calling, lncRNA (long non-coding RNA) analysis, fusion-gene detection (SV), RNA editing and single-cell transcriptome analysis, etc. In this chapter we’re learning how to perform alternative splicing analysis with rMATS, how to do scRNA-Seq transcriptome analysis via single-cell genomic amplification technology, and RNA editing data analysis with RCARE.

Single-Cell Trajectory?

1. Introduction

With the ever-accelerating pace of developments in NGS technology, more and more studies are being accmulated in transcriptome data research at base (nucleotide)-level. The most prominent area of transcriptome research are RNA-Seq and miRNA-Seq, but there are also other important areas of application like splicing analysis, lncRNA (long non-coding RNA), fusion gene detection, RNA editing detection, and single-cell RNA-seq analysis, etc. Apparently lncRNA plays an important role in development and maintenance, and the overall physiology of cancer according to this article. They are involved in chromatin remodeling, as well as transcriptional and post-transcriptional regulation through a variety of chromatin-based mechanisms and via cross-talk with other RNA species. They can act as decoys, scaffolds, and enhancer RNAs. As much as 85% of the human genome is transcribed, but only a small portion of them are protein coding transcripts. Non coding RNAs (ncRNAs, henceforth) can be roughly classified into two groups: short RNAs less than 200 nucleotides(nt) in length like miRNAs (microRNAs: 21-24nt in length) and piwi-interacting RNAs (piRNAs). On the other hand, long ncRNAs (lncRNAs) also exist, with length of around 200 nt or more. While miRNAs have been studied heavily, lncRNAs are less understood. Dysregulation of lncRNAs has been noted in glioblastoma, breastcancer, and leukemia. Such dysregulation commonly exerts impacts on cellular functions like cell proliferation, resistance to apoptosis, induction of angiogenesis, promotion of metastasis, and evasion of tumor suppressors. An emerging view on lncRNAs is that they are fundamental transcriptional regulators. It’s been quite an issue as to how its molecular intricacies work and several models of mechanism have been proposed, such as functioning as signal, decoy, scaffold, guide, enhancer RNAs, and short peptides. lncRNA can signal certain transcriptional activities (a hint?). Also it works as a decoy to provide non-functional binding sites

In DEG methods, we use DEGSeq (Wang et al., 2010), DESeq (Anders and Huber, 2010), edgeR (Robinson et al., 2010), baySeq (Hardcastle and Kell, 2010). In alternative splicing analysis we use MISO (Katz et al., 2010), rMATS(Shen et al., 2014), SUPPA (Alamancos et al., 2014), etc. miRNA-seq technology is for analyzing miRNA transcripts, which is part of non-coding RNAs. Some non-coding RNAs serve gene expression regulatory functions as well as protecting and preserving self-DNAs aganist invasive DNA that originate from external sources and assist in DNA synthesis and stability. miRNAs are also being investigated and applied in cancer diagnoses and prognoses prediction (Hayes et al., 2014: Thomas et al., 2015: Khan et al., 2015). You can also find cancer-specifically expressed miRNAs by DEG analysis via edgeR and DESeq2 (Love et al., 2014).

Single-cell sequencing is an optimized NGS technology that provides insights into the functions of each individual cell in the rich context of microenvironment that surrounds such cells. scRNA-seq (single-cell RNA sequencing) can provide us with the expression profile of individual cells. It is unrealistic to expect to obtain information about every RNA in the cell, but certainly the expression pattern is there and it can be analyzed through gene cluster analyses.

This practice covers the basics and real-data case study of alternative splicing data analysis, miRNA DEG analysis, single-cell tarnscriptome analysis, and RNA editing data analysis.

Question: How to discern specifically miRNAs or lncRNAs from a pool of transcriptome? Answer: Easy! Length.

2. Alternative Splicing Data Analysis

Alternative splicing is a phenomenon found in eukaryotic cells. It’s a way of getting more genetic diversity from the limited information space of genomic DNA. It’s known that there are about 22,000 human genes and alternative splicing provides a leeway into producing more diverse proteins. However, base-level analysis of transcript was difficult before the advent of RNA-seq. Albeit, there have been attempts to use microarrays with exon-junction regions as probes and tiling-array technology for transcript analyses although not it proved not sufficient. To be honest, RNA-seq based alternative splicing analysis was also unstable at the beginning, but nowadays there are various anlytical tools and analytic methods introduced to the field.

Percent Spliced In (PSI) value is a quantitative value that tells us how many of exons and introns of interest (if these sequences are included in the reads, those reads are called as “spliced-in’s”) are detected within the transcriptome data. Basically, it’s number of reads that are supportive of containing a certain isoform divided by the sum of reads containing that isoform and reads that are exclusive of that isoform.

\[\textrm{PSI (percent spliced-in)} = \frac{\textrm{Short reads that contain a certain isoforms}}{\textrm{(Short reads w/ isoforms) + (Short reads w.o. isoforms)}} \cdot 100 \%\]

We use rMATS, which is the alternative splicing analysis tool, to produce our own PSI value from RNA-seq data. (Then we would need 1. the exons and introns of interest, and 2. transcriptome data from RNA-seq, 3. quite possibly annotation data?)

What I imagine the whole process is going to be is like this: RNA-seq raw reads \(\rightarrow\) aligned to reference (bam) and assembled to transcripts \(\rightarrow\) annotated (gtf) \(\rightarrow\) rMATS (splicing analysis!) \(\rightarrow\) PSI value.

It actually goes like this:

Alternative Splicing Event Analysis with rMATS, workflow

RNA-seq fastq file \(\rightarrow\) Read alignment \(\rightarrow\) RNA-seq bam file \(\rightarrow\) Detection of alternative splicing events with rMATS \(\rightarrow\) Visualization of alternative splicing regions

Now look at the bam file data that’s already been aligned, check out its contents and characteristics with the following command.

(base) soh1@soh1:~/GDA/rMATS/testData$ samtools view 231ESRP.25K.rep-1.bam | less

The data looks like this:

HWI-ST568_0055:2:1101:14836:3706#TGACCA 353     chr1    17941   1       50M     =       18296   405     GGTGCTTGCTCTGGATCCTGTGGCGGGGGGGGCTCTGCAGGCCAGGGGCC      ggggggggggggggggggggggggggegdBBBBBBBBBBBBBBBBBBBBB      NM:i:3  NH:i:4  CC:Z:chr15      CP:i:102513175  HI:i:0

Let’s review sam format. It’s got read information (QNAME: query name), bitwise flag (FLAG: Information about pairing, strand, mate strand, etc. For example, 353 above would be translated to 1+32+64+256 and so 0x1, 0x20, 0x40, 0x100, which is 101100001 in binary.1), chromosome (RNAME: reference sequence name), position (POS: 1-based leftmost position of clipped alignment), mapping quality by phred-scale (MAPQ), CIGAR string (CIGAR: Extended CIGAR string (operations:MIDNSHP)), mate reference name (MRNM, “=” if same as RNAME), 1-based leftmost mate position (MPOS), inferred insert size (ISIZE), sequence on the same strand as the reference (SEQQuery), and query quality score (QUAL: ASCII-33=Phred base quality). Refer to this article in biostars for further information.

Now, use the BAM file as input to rMATS and find out how much alternative splicing occurred according to each type. rMATS computes PSI value for each gene between two comparison groups and expressed the significance of difference between the two PSI values with p-value. This process takes about 10 minutes.

Usage of options of rMATS are like this:

  1. -b1: BAM file input 1.
  2. -b2: BAM file input 2.
  3. -gtf: gtf file input. Has information about annotations of genes themselves and transcripts.
  4. -o: Output dir.
  5. -t: Designates the short read type. single-end 'single', paired-end 'paired'
  6. -len: designate length of each short read. You can find this info in CIGAR string, which is the 6th column in the BAM file.
  7. -c: A value between 0 and 1. Sets the threshold for differential splicing. This threshold is used to test the null hypothesis of differential splicing analysis threhold.
  8. -analysis: Sets analysis type. ‘P’ stands for Paired Analysis. ‘U’ stands for unpaired analysis.
  9. -libType: Sets library type. If the library is strand-specific, then “fr-firststrand”, “fr-secondstrand”. If strand-non-specific, then “fr-unstranded”.
python2 ../rMATS.3.2.2.beta/RNASeq-MATS.py -b1 ./231ESRP.25K.rep-1.bam,./231ESRP.25K.rep-2.bam -b2 ./231EV.25K.rep-1.bam,./231EV.25K.rep
-2.bam -gtf ../rMATS.3.2.2.beta/gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf -o bam_test -t paired -len 50 -c 0.0001 -analysis U -libType fr-firststrand
cd bam_test
less 'log.RNASeq-MATS.2020-08-13 13:29:08.771400.txt'

If I look into 'log.RNASeq-MATS.2020-08-13 13:29:08.771400.txt', then it’s that way.

2020-08-13 13:29:08,781 ################### folder names and associated input files #############
2020-08-13 13:29:08,782 SAMPLE_1\REP_1  ./231ESRP.25K.rep-1.bam
2020-08-13 13:29:08,782 SAMPLE_1\REP_2  ./231ESRP.25K.rep-2.bam
2020-08-13 13:29:08,782 SAMPLE_2\REP_1  ./231EV.25K.rep-1.bam
2020-08-13 13:29:08,782 SAMPLE_2\REP_2  ./231EV.25K.rep-2.bam
2020-08-13 13:29:08,782 #########################################################################

2020-08-13 13:29:08,782 start mapping..
2020-08-13 13:29:08,782 bam files are provided. skip mapping..
2020-08-13 13:29:08,782 done mapping..
2020-08-13 13:29:08,782 getting uniquely mapped reads or pairs
2020-08-13 13:29:08,782 getting unique SAM function..
2020-08-13 13:29:08,784 getting uniquely mapped reads or pairs for sample_1, rep_1 is done with status 512
2020-08-13 13:29:08,784 error in getting uniquely mapped reads from smaple_1, rep_1: 512
2020-08-13 13:29:08,784 error detail: awk: line 2: function and never defined
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
2020-08-13 13:29:08,784 Retry up to 3 more times..
2020-08-13 13:29:08,784 Retry: 1
2020-08-13 13:29:08,786 getting uniquely mapped reads or pairs for sample_1, rep_1 is done with status 512
2020-08-13 13:29:08,786 error in getting uniquely mapped reads from smaple_1, rep_1: 512
2020-08-13 13:29:08,786 error detail: awk: line 2: function and never defined
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
2020-08-13 13:29:08,786 Retry: 2
2020-08-13 13:29:08,788 getting uniquely mapped reads or pairs for sample_1, rep_1 is done with status 512
2020-08-13 13:29:08,788 error in getting uniquely mapped reads from smaple_1, rep_1: 512
2020-08-13 13:29:08,788 error detail: awk: line 2: function and never defined
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
2020-08-13 13:29:08,788 Retry: 3
2020-08-13 13:29:08,790 getting uniquely mapped reads or pairs for sample_1, rep_1 is done with status 512
2020-08-13 13:29:08,790 error in getting uniquely mapped reads from smaple_1, rep_1: 512
2020-08-13 13:29:08,790 error detail: awk: line 2: function and never defined
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
2020-08-13 13:29:08,790 There is an exception in getting unique SAM files
2020-08-13 13:29:08,790 Exception: <type 'exceptions.Exception'>
2020-08-13 13:29:08,790 Detail: 

The error says it all. function “and” never defined in line 2.

Apparently there have been issue reported about this: link1, link2

When I look for the awk version with awk -W version, it looks like this

(base) soh1@ssoh1:~/GDA/rMATS/rMATS.3.2.2.beta$ awk -W version
mawk 1.3.4 20200120
Copyright 2008-2019,2020, Thomas E. Dickey
Copyright 1991-1996,2014, Michael D. Brennan

random-funcs:       srandom/random
regex-funcs:        internal
compiled limits:
sprintf buffer      8192
maximum-integer     2147483647

According to this stackoverflow thread and this manual for mawk, mawk is a minimal-featured awk designed for speed of execution over functionality, written by Mike Brennan. Therefore it may behave differently from gawk or a POSIX awk. Although its man page claims that it’s POSIX compliant, it seems that it’s really not POSIX-compliant in practice.

So I think I would have to install gawk with sudo apt-get install gawk command to properly work with the code.

I check for awk, and type in awk --version, which should work with gawk, the output is like this:

(base) soh1@soh1:~/GDA/rMATS/rMATS.3.2.2.beta$ awk --version
awk: not an option: --version

So what’s awk? awk stands for Aho + Weinberger + Kernighan (Alfred V. Aho, Peter J. Weinberger, Brian W. Kernighan). awk’s purpose is to select record from a file, pull that record of particular pattern, and manipulate or datarize the selected portion of record values. For example, you can perform various stuff with awk like: - printing out the entire content of a text file, - print only specific fields of a file, - add or insert a string of interest into specific fields that satisfy a filtering condition and print those fields, - print records that contain a pattern - print the result of certain operation (calculation) on a specific field(s) - print differentially according to how field values compare.

For in-depth learning experience of awk, visit this blog called Recipes for Developer.

So now I install gawk.

(base) soh1@soh1:~/GDA/rMATS/rMATS.3.2.2.beta$ sudo apt-get install gawk
[sudo] soh1의 암호: 
패키지 목록을 읽는 중입니다... 완료
의존성 트리를 만드는 중입니다       
상태 정보를 읽는 중입니다... 완료
제안하는 패키지:
  gawk-doc
다음 새 패키지를 설치할 것입니다:
  gawk
0개 업그레이드, 1개 새로 설치, 0개 제거 및 0개 업그레이드 안 함.
418 k바이트 아카이브를 받아야 합니다.
이 작업 후 1,708 k바이트의 디스크 공간을 더 사용하게 됩니다.
받기:1 http://kr.archive.ubuntu.com/ubuntu focal/main amd64 gawk amd64 1:5.0.1+dfsg-1 [418 kB]
내려받기 418 k바이트, 소요시간 2초 (169 k바이트/초)
Selecting previously unselected package gawk.
(데이터베이스 읽는중 ...현재 368380개의 파일과 디렉터리가 설치되어 있습니다.)
Preparing to unpack .../gawk_1%3a5.0.1+dfsg-1_amd64.deb ...
Unpacking gawk (1:5.0.1+dfsg-1) ...
gawk (1:5.0.1+dfsg-1) 설정하는 중입니다 ...
Processing triggers for man-db (2.9.1-1) ...

Now if I testrun the code echo | awk'{print systime()}', it works fine…..? Not yet.

Then I run RNASeq-MATS.py again.

Fuck it. Whatever. I’m running it on server.

Instead, I used a conda environment specifically for rMATS.

There were warnings about RNASeq-MATS.py havnig error of indexing in line 214, which I looked into and found out that a variable called output wasn’t being split correctly. I looked up output and discovered that it was referring to lines,

myCmd = 'samtools view '+fq+' | head -n 1';
status,output=commands.getstatusoutput(myCmd);

and this meant that it’s gotta do something with samtools not being installed.

When I installed samtools, it still showed warnings and samtools was giving me errors in this issue log.

This was about openssl version compatibility as mentioned below, so I removed samtools and reinstalled along with downgraded openssl.

To summarize, this server has…:

  • python version 2.7.15 (conda install python=2.7.15)
  • numpy (conda install numpy)
  • scipy (conda install scipy)
  • samtools 1.9 with downgraded openssl (openssl 1.0.0 instead of openssl 1.1.1) \(\rightarrow\) There was an issue with libcrypto.so.1.1.1 being referred to, which is apparently incompatible with samtools 1.9. So I installed libcrypto.so.1.0.0 that is included in openssl 1.0.0. (conda install c- bioconda samtools openssl=1.0)

This solved the issue for me.

python ../rMATS.3.2.2.beta/RNASeq-MATS.py -b1 ./231ESRP.25K.rep-1.bam,./231ESRP.25K.rep-2.bam -b2 ./231EV.25K.rep-1.bam,./231EV.25K.rep
-2.bam -gtf ../rMATS.3.2.2.beta/gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf -o bam_test -t paired -len 50 -c 0.0001 -analysis U -libType fr-firststrand

It gives me report (less summary.txt in bam_test) like this:

################### folder names and associated input files #############
SAMPLE_1\REP_1  testData/231ESRP.25K.rep-1.bam
SAMPLE_1\REP_2  testData/231ESRP.25K.rep-2.bam
SAMPLE_2\REP_1  testData/231EV.25K.rep-1.bam
SAMPLE_2\REP_2  testData/231EV.25K.rep-2.bam
#########################################################################

############################# MATS Report #############################
======================================================================
EventType       NumEvents.JC.only       SigEvents.JC.only       NumEvents.JC+readsOnTarget      SigEvents.JC+readsOnTarget
======================================================================
SE      35      1 (1:0) 49      1 (1:0)
MXE     14      0 (0:0) 22      0 (0:0)
A5SS    7       0 (0:0) 8       0 (0:0)
A3SS    8       0 (0:0) 10      0 (0:0)
RI      13      0 (0:0) 25      0 (0:0)
======================================================================
################# Report Legend ################
EventType: Type of AS event
        SE: Skipped exon
        MXE: Mutually exclusive exon
        A5SS: Alternative 5' splice site
        A3SS: Alternative 3' splice site
        RI: Retained intron
NumEvents.JC.only: total number of events detected using Junction Counts only
SigEvents.JC.only: number of significant events detected using Junction Counts only
        The numbers in the parentheses (n1:n2) indicate the number of significant events that have higher inclusion level for SAMPLE_1 (n1) or for SAMPLE_2 (n2)
NumEvents.JC+readsOnTarget: total number of events detected using both Junction Counts and reads on target
SigEvents.JC+readsOnTarget: number of significant events detected using both Junction Counts and reads on target
        The numbers in the parentheses (n1:n2) indicate the number of significant events that have higher inclusion level for SAMPLE_1 (n1) or for SAMPLE_2 (n2)
################################################

If you want more detailed info, read SE.MATS.ReadsOnTargetAndJunctionCounts.txt in bam_test.

It contains information like this:

ID      GeneID  geneSymbol      chr     strand  exonStart_0base exonEnd upstreamES      upstreamEE      downstreamES    downstreamEE    ID      IC_SAMPLE_1     SC_SAMPLE_1     IC_SAMPLE_2     SC_SAMPLE_2     IncFormLen      SkipFormLen     PValue  FDR     IncLevel1       IncLevel2       IncLevelDifference
6461    "ENSG00000122566"       "HNRNPA2B1"     chr7    -       26237450        26237486        26237241        26237352        26240191        26240366        6461    14,13   1,0     0,0     13,13   85      49      5.977884853791693e-12   2.92916357836e-10       0.89,1.0        0.0,0.0 0.945

Giving us alternative splicing user-defined ID, gene ID, geneSymbol of spliced transcript, chromosome of splicing event, strand of splicing event, skipped exon start position, skipped exon end position, skipped exon upstream exon start, skipped exon upstream exon end, skipped exon downstream exon start, skipped exon downstream exon end, user defined ID yet again, inclusion counts for SAMPLE_1 (w/ replicates separated by comma), skipping counts for SAMPLE_1 (w/ replicates separated by comma), same for SAMPLE_2, same for SAMPLE_2, length of inclusion form (used for normalization), length of skipping form (used for normalization), p-value for differential splicing, FDR for differential splicing, inclusion level for SAMPLE_1 replicates from normalized counts, inclusion level for SAMPLE_2 replicates from normalized counts, average(IncLevel1) - average(IncLevel2).

Now, let’s draw the graph of exons that showed the most statistically significant alternative splicing from SE.MATS.ReadsOnTargetAndJunctionCounts.txt. This process takes about 5 min.

This requires rmats2sashimiplot package, so I installed it via conda install -c bioconda rmats2sashimiplot.

Apparently you can also visualize sashimi plots in IGV, so I read the following definition of sashimi plot: link. Or maybe the original paper publication on sashimi plot is here and here and here. Actually, from the Broad Institute homepage I found this link that officially documents about sashimiplot and its containing library MISO.

What is sashimi_plot?

Sashimi plots visualize splice junctions for multiple samples from their alignment data alongside genomic coordinates and a user-specified annotation track. It’s a utility for automatically producing publication-quality plots Sashimi plots for RNA-Seq analyses of isoform expression. It’s part of the MISO framework. In particular, it can 1) plot raw RNA-Seq densities along exons and junctions for multiple samples, while simultaneously visualizing the gene model/isoforms to which reads map, and 2) plot MISO output alongside the raw data or separately.

  1. Anatomy of a Sashimi plot. Gene model annotation containing two isoforms differing by inclusion/exclusion of middle exon. Sashimi plot for the two grey exons (blue boxed region) is shown, where genomic reads are converted into read densities (per-base expression on y-axis) and junction reads are plotted as arcs whose width is proportional to the number of reads aligned to the junction spanning the exons connected by arc. Genomic coordinates are shown on x-axis. (B) Inputs required for making a Sashimi plot. Gene model annotations (in GFF format), RNA-Seq read alignments (BAM format) and optionally isoform expression estimates (by MISO) are used to make Sashimi plots. Sashimi plots can be made with a stand-alone program that makes customizable publication quality figures, or dynamically from the IGV browser.

The genomic reads are converted into read densities (per-base expression on y-axis: RPKM) and it’s expressed as width in the plot. And junction reads are plotted as arcs whose width is proportional to the number of reads aligned to the junction spanning the exons connected by arc. (The reads that overlap over the exons)

  1. Sashimi plot (stand-alone) for alternatively spliced exon and flanking exons in four samples (colored by experimental condition). Per-base expression is plotted on y-axis of Sashimi plot, genomic coordinates on x-axis, and mRNA isoforms quantified are shown on bottom (exons in black, introns as lines with arrow heads). (B) Optional display of isoform expression information produced by MISO. Posterior distributions over Ψ values are shown as histograms (Ψ values on x-axis, frequency on y-axis). (C) Genomic region of interest in IGV along with two alignment tracks (top). (D) Sashimi plot for region shown in (C). Genomic coordinates are plotted on x-axis and read density (whose value is configurable via IGV) on y-axis.

So let’s get to it and plot our own sashimi plot.

cd testData
rmats2sashimiplot -b1 231ESRP.25K.rep-1.bam,231ESRP.25K.rep-2.bam -b2 231EV.25K.rep-1.bam,231EV.25K.rep-2.bam -t SE -e ~/GDA/rMATS/bam_test/MATS_output/SE.MATS.ReadsOnTargetAndJunctionCounts.txt -l1 ESRP -l2 EV -o test_events_output
  • -b1: first BAM file
  • -b2: second BAM file
  • -t: splicing type (here it is SE (skipped exon))
  • -e: what’s the rMATS output file for input on the event?
  • -l1: sample label 1 (what’s the label of the first sample?)
  • -l2: sample label 2 (what’s the label of the second sample?)
  • -o: output directory. (where the analysis result gonna be stored?)

Above options were wrong.

It should be changed to -b1 \(\rightarrow\) –b1, -b2 \(\rightarrow\) –b2, -l1 \(\rightarrow\) –l1, -l2 \(\rightarrow\) –l2.

Therefore I executed the following command again.

rmats2sashimiplot --b1 231ESRP.25K.rep-1.bam,231ESRP.25K.rep-2.bam --b2 231EV.25K.rep-1.bam,231EV.25K.rep-2.bam -t SE -e ~/GDA/rMATS/bam_test/MATS_output/SE.MATS.ReadsOnTargetAndJunctionCounts.txt --l1 ESRP --l2 EV -o test_events_output

Go into the output directory test_events_output/Sashimi_plot and see.

The red and orange plots’ width in the resulting figure means ESRP and EV transcipts’ read sequencing depths. If you compare the two isoforms in the bottom, ESRP seems to have an exon inclusion (one exon is included: IE), and EV seems to have an exon skipping (that particular exon is skipped entirely: SE) isoform. Black boxes denote exons and line with arrowheads denote introns. You can also note that although reads that occur in the splice junctions sometimes are not detected and arcs not showing, it doesn’t mean such combination of exons doesn’t exist.

3. Single-Cell RNA-Seq Data Analysis

Refer to this excellent report on the current state of Single-cell RNA sequencing tech.

Single RNA-Seq data analysis can be applied to tackle these problems:

  1. Distinguish heterogeneity within a sample (heterogeneous sample)

  2. Rare cell types (those that are hard to detect)

  3. Cell evolution & Phylogenetics

  4. Somatic mosaicism

  5. Analysis of unculturable micro-organisms

  6. Disease analysis

The whole experimental procedure goes like this:

  1. Single-cell separation

  2. Nucleic Acid Prep

  3. Amplification

  4. Library Prep

  5. Sequencing and Bioinformatic Analysis

The Generally Accepted Process of Single cell RNA analysis

1) Pre-processing

Physically separte different types of cells (perhaps by FACS?) and put in different barcode sequences with adapters. FPM normalization usually.

2) Dimensionality reduction

Single cell sequencing data contains at the least 20000~30000 gene expression information and therefore is high dimensional data. Takes a long time to actually take all these variables into account to perform computation would be a slow and painful process, might be vulnerable to noises, and contains too much of useless information. Therefore we remove statistically insignificant genes from the list (about 2000~3000 genes remaining). Additional dimension reduction techniques like PCA (principal component analysis) or CCA (canonical correlation analysis) are applied to group genes that show similar patterns of variability. The primary purpose of this step is to maximally exclude noises while retaining the biologically relevant and crucial variables.

3) Conversion of Data into Graphs

After conversion, the data are coordinates in the low dimension and unique coordinates are assigned to each and every cell. Similar cells would have similar coordinates and thus clustered among themselves, forming a geometrical morphology. We need various geometrical approaches for understanding cell clustering. The most representative of them is a graph. A collection of point and lines form a graph and it’s very useful for simplifying complex morphology for analysis. K-nearest neighbor graph, which is the graph that connects k-number of dots (nodes) that are closes, is the most used form in single-cell RNA sequencing.

4) Graph clustering and Pseudotime

Data converted to graph is applied to various analyses. Cells that belong to a group form a well-connected cluster, and using algorithms that cut out such clusters are there for easy classification. Louvain clustering is used here. Discovering marker genes, classifying them into different types, and analyzing the features of each type (characteristics) is the primary process in single-cell sequencing.

Clustering is cutting out a particular portion of cells and classifying them, and at the opposite end is the approach where one studies the connections within the clusters. Cells in development or differentiation undergo continuous change in their expression profiles, and therefore the connection looks like a smooth cylindrical shape once expressed as a graph. Simplifying this connectivity structure, finding the central axis, and allocating the cells along the axis allows us to see the cell’s change in time. You could call it “time projected onto space”, and therefore is called “pseudo-time”. There are already many pseudo-time analysis methods developed out there and there are different optimal analyses according to the diversity of the structure. It’s better if you try many. One thing you gotta beware is that cells are aligned simply by the similarity of expression pattern, and this can’t be necessarily equated to actual development or differentiation steps. For example, when two different cell types turn into a strikingly similar cell type, it’s hard to discern which is the starting point and which is at the end point. From this problem, analytical method called velocyto that predicts where a cell is headed in terms of directionality based on a model that models ratio between introns and exons. The model is based on the premise that newly transcribed genes probably have higher ratio of intron sequences. This adheres to the common sense because, of course, freshly synthesized transcripts probably didn’t have time to get spliced and therefore more introns are present in the transcriptome. (Introns would probably be excised and degraded by cellular machinery. Refer to this article called Lariat intronic RNAs in the cytoplasm of vertebrate cells. It says that lariat RNAs are destroyed within minutes of formation in the cell nucleus.) However, many of intronic RNAs are exported to the cytoplasm where they remain as stable circular molecules and are thought to play unexpected roles in cell metabolism.

Analytic Packages

Usually implemented in R or Python. Seurat in R and Scanpy in Python are widely used. These packages are convenient because they provide the chain of tools for converting, storing, and analyzing these data. There are softwares that are specifically for pseudo-time analysis or canceling experimental variations. So it’s advisable to study them and put to appropriate applications.

Prospects?

  • Sequencing RNA and other markers (other macromolecules)

  • Single-cell Genomics

  • Merging with Imaging: Information about the cell’s location within the tissue disappears. That’s one main disadvantage of single cell sequencing. There are imaging techniques being developed to complement this shortcoming. Usually number of genes that can be viewed simultaneously decreases as resolution increases. If we find efficient marker combination, it might just be possible to merge transcriptome data and imaging data of the entire tissue together to see the expression information at single cell resolution. If we think of biological processes as comprising of three axes: time, space, and gene expression - then retrieving space and gene expression data without loss may become a reality.

Install the necessary packages

We need: - devtools - ggplot2 - cowplot2 - dplyr: Data processing package, C++ version of plyr. Fast computing speed. Functions that pertain to data.frame, data.table, support for databases like MySQL, PostgreSQL, SQLite, BigQuery. Functions like filter() that conditionally extracts rows, and arrange(). and more are super useful. - Rmisc: Ryan Miscellaneous contains functions useful for data analysis and utility operations. https://cran.r-project.org/web/packages/Rmisc/index.html - monocle: Analysis toolkit for single-cell RNA-seq, can performn pseudotimne analysis of single-cell trajectories, clustering by gene expressin pattern, find differential expression (DEG), etc. http://cole-trapnell-lab.github.io/monocle-release/ - rhdf5: Interface between R and HDF5 (hierarchical data format version 5), which is a file format for storing large size data. - cellrangerRkit-2.0.0: demultiplexing, barcode counting, etc. https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger

We don’t have: - Rmisc - monocle - rhdf5 - cellrangerRkit-2.0.0

Load libraries

library(ggplot2)
library(cowplot) # Helps with creating publication-quality figures on top of ggplot2
## 
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
##   default ggplot2 theme anymore. To recover the previous
##   behavior, execute:
##   theme_set(theme_cowplot())
## ********************************************************
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(monocle)
## Loading required package: Matrix
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:Matrix':
## 
##     which
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which, which.max, which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## Loading required package: DDRTree
## Loading required package: irlba
library(cellrangerRkit)
## Loading required package: RColorBrewer
## Loading required package: bit64
## Loading required package: bit
## 
## Attaching package: 'bit'
## The following object is masked from 'package:base':
## 
##     xor
## Attaching package bit64
## package:bit64 (c) 2011-2017 Jens Oehlschlaegel
## creators: integer64 runif64 seq :
## coercion: as.integer64 as.vector as.logical as.integer as.double as.character as.bitstring
## logical operator: ! & | xor != == < <= >= >
## arithmetic operator: + - * / %/% %% ^
## math: sign abs sqrt log log2 log10
## math: floor ceiling trunc round
## querying: is.integer64 is.vector [is.atomic} [length] format print str
## values: is.na is.nan is.finite is.infinite
## aggregation: any all min max range sum prod
## cumulation: diff cummin cummax cumsum cumprod
## access: length<- [ [<- [[ [[<-
## combine: c rep cbind rbind as.data.frame
## WARNING don't use as subscripts
## WARNING semantics differ from integer
## for more help type ?bit64
## 
## Attaching package: 'bit64'
## The following object is masked from 'package:Biobase':
## 
##     cache
## The following objects are masked from 'package:BiocGenerics':
## 
##     %in%, match, order, rank
## The following objects are masked from 'package:base':
## 
##     %in%, :, is.double, match, order, rank
## Loading required package: Rmisc
## Loading required package: lattice
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
library(HSMMSingleCell) # Single-cell RNA-Seq for differentiating human skeletal muscle myoblasts

For future reference, this is a list of single-cell data packages provided through Bioconductor: link.

scRNA Directory Set and Cell Ranger Object Generation

my_dir <- "./scRNA"
gbm <- load_cellranger_matrix(my_dir)
## Searching for genomes in: ./scRNA/outs/filtered_gene_bc_matrices 
## Using GRCh38 in folder: ./scRNA/outs/filtered_gene_bc_matrices/GRCh38 
## Loaded matrix information
## Loaded gene information
## Loaded barcode information
## Could not find summary csv: 
##   ./scRNA/outs/metrics_summary.csv.
## This file is only necessary if you are performing depth-normalization (calling the equalize_gbms function) in R.
## If this pipestance was produced by `cellranger aggr` with the default parameters, depth-normalization in R (via equalize_gbms) is not necessary.
# check the object gbm's class
class(gbm)
## [1] "GeneBCMatrix"
## attr(,"package")
## [1] "cellrangerRkit"
attr(gbm, "package")
## NULL
dim(exprs(gbm))
## [1] 33694  8381

exprs: These generic functions access the expression and error measurements of assay data stored in an object derived from the eSet-class. eSet is a class of container to contain high-throughput assays and experimental metadata. These contain one or more identical-sized matrices as assayData elements. Derived classes (e.g., ExpressionSet-class, SnpSet-class) specifiy which elements must be present in the assayData slot.

eSet object cannot be instantiated directly. It’s a virtual class, so instances cannot be created. Objects created under previous definitions of eSet-class can be coerced to the current classes derived from eSet using updateOldESet.

eSet object contains slots of assayData, phenoData, featureData, experimentData, annotation, protocolData, .__classVersion__ and methods of sampleNames, featureNames, dimnames, dims, phenoData, pData, varMetadata, varLabels, featureData, fData, assayData, experimentData, etc.

Aside all this, there’s this awesome site that explains single-cell genomics neatly. It’s a blog run by Rutgers Stem Cell Research Center led by Professor Hart. It says the 10X Chromium system has become the gold standard for single-cell sequencing and therefore 10X Genomics’ Cell Ranger software is

exprs(gbm): Retrieve Expression

exprs(gbm)[1:5, 1:5]
## 5 x 5 sparse Matrix of class "dgTMatrix"
##                 AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 AAACCTGAGCTAGTGG-1
## ENSG00000243485                  .                  .                  .
## ENSG00000237613                  .                  .                  .
## ENSG00000186092                  .                  .                  .
## ENSG00000238009                  .                  .                  .
## ENSG00000239945                  .                  .                  .
##                 AAACCTGCACATTAGC-1 AAACCTGCACTGTTAG-1
## ENSG00000243485                  .                  .
## ENSG00000237613                  .                  .
## ENSG00000186092                  .                  .
## ENSG00000238009                  .                  .
## ENSG00000239945                  .                  .

pData(gbm): Retrieve experimental phenotypes (phenoData)

dim(pData(gbm))
## [1] 8381    1

head(pData(gbm))
##                               barcode
## AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1
## AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1
## AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1
## AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1
## AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1
## AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1

fData(gbm): Retrieve Feature Data

head(fData(gbm))
##                              id        symbol
## ENSG00000243485 ENSG00000243485  RP11-34P13.3
## ENSG00000237613 ENSG00000237613       FAM138A
## ENSG00000186092 ENSG00000186092         OR4F5
## ENSG00000238009 ENSG00000238009  RP11-34P13.7
## ENSG00000239945 ENSG00000239945  RP11-34P13.8
## ENSG00000239906 ENSG00000239906 RP11-34P13.14

The main object class in Monocle is the CellDataSet object and it’s what it requires its data to be housed in. CellDataSet extends Bioconductor’s ExpressionSet class and the same basic interface is supported. to get started we need to create a CellDataSet object with the newCellDataSet() function. The CellDataSet object was derived from the ExpressionSet class, so it’s easy to create, since the gbm object was also derived from the same class. However, Monocle expects that the gene symbol column in the feature data is called gene_short_name or else you will get a warning. (Make sure that gene symbol column in the feature data is called gene_short_name.)

my_feat <- fData(gbm)

head(my_feat, 5)
##                              id       symbol
## ENSG00000243485 ENSG00000243485 RP11-34P13.3
## ENSG00000237613 ENSG00000237613      FAM138A
## ENSG00000186092 ENSG00000186092        OR4F5
## ENSG00000238009 ENSG00000238009 RP11-34P13.7
## ENSG00000239945 ENSG00000239945 RP11-34P13.8
names(my_feat) <- c('id', 'gene_short_name') # Gene Symbol column's name must be changed to gene_short_name
# in order to avoid warnings.

Generate CellDataSet object of R Monocle package with newCellDataSet()

my_cds <- newCellDataSet(exprs(gbm),
                         phenoData = new("AnnotatedDataFrame", data=pData(gbm)),
                         featureData = new("AnnotatedDataFrame", data=my_feat),
                         lowerDetectionLimit = 0.5,
                         expressionFamily = negbinomial.size())

negbinomial.size() = Negative Binomial Distribution Family Function With Known Size, Maximum likelihood estimation of the mean parameter of a negative binomial distribution with known size parameter.

my_cds
## CellDataSet (storageMode: environment)
## assayData: 33694 features, 8381 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: AAACCTGAGCATCATC-1 AAACCTGAGCTAACTC-1 ...
##     TTTGTCATCTGCTTGC-1 (8381 total)
##   varLabels: barcode Size_Factor
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSG00000243485 ENSG00000237613 ... ENSG00000268674
##     (33694 total)
##   fvarLabels: id gene_short_name
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
my_cds <- estimateSizeFactors(my_cds)
my_cds <- estimateDispersions(my_cds)
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Removing 146 outliers
`group_by_()` is deprecated as of dplyr 0.7.0.
Please use `group_by()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.`select_()` is deprecated as of dplyr 0.7.0.
Please use `select()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.

Doesn’t work with deprecated dplyr. So I downgraded it.

Doesn’t work yet again, and the program complains about rlang version being too high. Let’s install rlang 0.3.0.

Whatever. It works well on Windows.

Clustering Cells that Do Not use Marker Genes

So single-cell sequencing data is a high-dimensional information of 20000~30000 genes in several hundred thousands to millions of cells. It takes a lot of time to compute high-dimensional data and it’s also vulnerable to various noises. Therefore we combine about 30000 genes in the analysis and perform dimensionality reduction to about tens of genes. The Monocle package itself provides cell clustering analysis methods. There is a tendency for good results when using marker genes, but when such pre-obtained knowledge isn’t available we must perform clustering analysis without selecting for particular marker genes. We make use of average expression and variants of each gene throughout cells.

So, take genes with mean_expression over 0.1 that we clustered earlier. Use setOrderingFilter() function to designate which genes will be used for clustering.

plot_ordering_genes() function gives us average expression level compared to the empirically obtained empirical dispersion, and it marks gene set to be used in cluster analysis with black dot for emphasis.

disp_table <- dispersionTable(my_cds)
head(disp_table)
##           gene_id mean_expression dispersion_fit dispersion_empirical
## 1 ENSG00000238009    0.0026983425      52.863362          0.000000000
## 2 ENSG00000279457    0.1548382003       1.068659          0.356959667
## 3 ENSG00000228463    0.0543320140       2.767983          0.005359107
## 4 ENSG00000237094    0.0022958750      62.104025         30.358663597
## 5 ENSG00000230021    0.0002830376     502.693229          0.000000000
## 6 ENSG00000237491    0.0245526857       5.943231          0.000000000
table(disp_table$mean_expression >= 0.1)
## 
## FALSE  TRUE 
## 15422  4012

So what’s a table?

class(table(disp_table$mean_expression >= 0.1))
## [1] "table"

Table class object can be created by converting a factor into table of occurrence of each factor in the factor object.

https://www.cyclismo.org/tutorial/R/types.html#tables

Take only the subset genes of disp_table where the mean_expression is greater than 0.1.

unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1)

Then, with setOrderingFilter, you designate which genes will be used for clustering. So you take gene_ids we obtained from subsetting for mean_expression bigger than 0.1 and select for those genes in my_cds (my cell data) and assign that data back in my_cds.

my_cds <- setOrderingFilter(my_cds, unsup_clustering_genes$gene_id)

Now, plot it out with plot_ordering_genes(), where it plots empirical dispersion gained from empirical knowledge versus mean_expression.

plot_ordering_genes(my_cds)
## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis

clusterCells() function performs cluster analysis of CellDataSet objects. (Arguments are arbitrarily selected to 10 dimension, 15 clusters.) The cluster analysis results are stored in phenoData table.

Now we reduce dimensionality through a process called reduceDimension.

So what’s tSNE?

t-SNE = t-Stochastic Neighbor Embedding. It reduces dimensions so that two vectors that are similar in higher dimensions are also closely related, distance-wise, in two-dimensional space as well so that we can also visualize how similarly the points are clustered together. This is a means of visualizing the similarity in higher dimensions in the lower dimensions. Refer to this blog article https://lovit.github.io/nlp/representation/2018/09/28/tsne/. Also refer to this paper where t-SNE was first proposed. This blog article is also a good mathematical take on explaining the manifold learning methods.

Actually there are many ways (algorithms) to reduce dimensionality: MDS (multi-dimensional scaling), ISOMAP, Locally Linear Embedding (LLE), t-SNE (and perhaps PCA).

These algorithms put emphasis on conservation of different information.

  1. MDS: Wants points that are far away in space to still conserve that distance in low dimensions. Doesn’t care about close points being pulled apart a little.
This picture is from Identification of close Relatigves in the HUGO Pan-Asian SNP Database, 2011, by Yang, et al.
  1. ISOMAP (isometric feature mapping): Wants to represent a space that’s like a rolled curve in the plane, straightening it out. Below is a perfect example that shows the Swiss Roll picture. The red and blue dots are very close in terms of the euclidean distance, but very far in the geodesic distance. Geodesic means the closest path between two points on a curved-surface or space (non-euclidean). One of the oldest and basic manifold learning algorithms. The following is how you do it.
  • Find L (hyper-parameter) nearest neighbor, and connect them via graph.
  • Weight the each edge based on euclidean distance.
  • Measure the distance from i-th sample to j-th sample, expressed as \(d_{ij}\). This will form distance matrix \(D\).
  • Get Gram matrix \(G = -\frac{1}{2}HDH\). (\(H=I-\frac{1}{N}11^T\))
  • Eigen-decompose \((v, \lambda)\) G.
  • Project by calculating top K eigen pairs.
  1. LLE (locally linear embedding):Learns manifold based on linear model, that is based off of geometric intuition. So high-dimensional data are non-linear if you look at them globally, but pretend (assume) they are locally linear just like in isomap. So the original data \(D=[x_1,x_2,...,x_N], x \in R^d\) goes to \([z_1, z_2, ... , z_N], z \in R^k\)

  1. Stochastic Neighbor Embedding

Now, LLE accounts for closest k number of points, but doesn’t care at all about other points. The whole structural information COULD in fact be conserved, but there’s no guarantee. In order to solve this we consider points that are closer to the original space, but also must consider points that are further away. Following excerpt from the t-SNE paper.

“In particular, most of the techniques are not capable of retaining both the local and the global structure of the data in a single map.”

Most of the techniques previously mentioned are not able to retain both the local and global structure of the data. But t-SNE can. It has a parameter called perplexity, rather than k like in LLE, but does not react too sensitively to the parameter. There’s a certain range of well-working value obtained from empirical knowledge. Such robustness of t-SNE in retaining the data structure and reflecting/projecting that onto a 2D graphical map is why we use t-SNE in dimensionality reduction.

How to embed high dimensional vectors to low dimensional space

Original data similarity \(p_{ij}\) and embedding space’s data similarity \(q_{ij}\) are defined. In order to get \(p_{ij}\), you gotta calculate the similarity between two points \(x_i\) to \(x_j\), which is designated as \(p_{j|i}\). This similarity is in the form of probability. In order to define this similarity value, first compute Euclidean distance \(|x_i-x_j|\) from the standard point \(x_i\) to all the other points. And based off of this euclidean distance, express the distance between \(x_i\) and \(x_j\) in terms of probability.

\(\sigma_i\) shows the standard deviation. Divide distance between \(a_i\) and \(a_j\) and take the negative exponential of that.

\(exp(\frac{-|x_i-x_j|^2}{2\sigma_i^2})\)

This distribution is non-negative at all values, and it becomes bigger and bigger as the two points are closer together.

If you divide this value by the sum of that expression over all points, whereas \(\sum_{k=1} exp(\frac{-|x_i-x_k|^2}{2\sigma_i^2})\), then you get a probability-looking stuff.

\[p_{j|i}=\frac{exp(\frac{-|x_i-x_j|^2}{2\sigma_i^2})}{\sum_{k \ne i} exp(\frac{-|x_i-x_k|^2}{2\sigma_i^2})}\]

This \(p_{j|i}\) performs role as a stochastic modeling. Stochastic modeling means a model where a point moves to another point at each time according to a certain probability distribution. So we define here the probability of point \(x_i\) moving to \(x_j\) at the next time point as \(p_{j|i}\).

Note that \(p_{j|i}\) is not necessarily equal to \(p_{i|j}\). Therefore, for symmetry, we define the arithmetic average as the similarity between two points. And then because there are total \(n\) points, we can make the entire sum of similarities of all points equal to 1.

\[p_{ij} = \frac{p_{i|j}+p_{j|i}}{2n}\]

You could also define \(p_{ij}\) like below using identical \(\sigma\) for all \(x_i\)s. It could be problematic for dataset with outliers, though.

\[p_{ij}=\frac{exp(-|x_i-x_j|^2/2\sigma ^2)}{\sum _{k \ne l}exp(-|x_i-x_j|^2/2\sigma ^2)}\] Now that we defined similarity \(p_{ij}\) in the original space, define similarity \(q_{ij}\) in the embedding space. It’s also defined so that \(\sum _{i,j}q_{ij}=1\). For example, you could use \((|y_i-y_j|^2)^{-1}\) as similarity, but since inverse of something infinitesimally small is close to infinity, we add \((1+|y_i-y_j|^2)^{-1}\) to get stable inverse.

\[q_{ij} = \frac{(1+|y_i-y_j|^2)^{-1}}{\sum _{k \ne l}(1+|y_k-y_l|^2)^{-1}}\]

Now that we defined two similarities in two spaces (original and embedding), t-SNE learns by data points to get the \(y_i\) and \(y_j\) that defines the \(q_{ij}\) that goes close to \(p_{ij}\). It’s like moving around the position of \(y_i\), \(y_j\) while looking at target \(p_{ij}\).

We use gradient descent in t-SNE for learning. It moves around points of \(y\) around with the rate of change defined by gradient:

\[\frac{\partial C}{\partial y_i} = \sum _j (p_{ij}-q_{ij})(y_i-y_j) \frac{1}{1+|y_i-y_j|^2}\]

All this concept and \(+ \alpha\) is nicely explained in this article.

Now that we’ve covered the very basic concepts… let’s get to it.

# Reduce dimensionality

my_cds <- reduceDimension(my_cds, max_components = 2, num_dim = 10, reduction_method = 'tSNE', verbose = TRUE)
## Remove noise by PCA ...
## Reduce dimension by tSNE ...

reduceDimension function uses both PCA and tSNE.

my_cds <- clusterCells(my_cds, num_clusters=15)
## Distance cutoff calculated to 4.315355

Why is it different from the book? Idk.

head(pData(my_cds))
##                               barcode Size_Factor num_genes_expressed  UMI
## AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1   0.5672842                 871 2394
## AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1   0.4014116                 806 1694
## AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1   1.0710628                1316 4520
## AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1   0.6606467                 898 2788
## AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1   1.1058961                1526 4667
## AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1   1.0521060                1495 4440
##                    Cluster peaks  halo     delta      rho
## AAACCTGAGCATCATC-1       2 FALSE FALSE 0.1649455 113.7337
## AAACCTGAGCTAACTC-1       6 FALSE  TRUE 0.4159332 147.4402
## AAACCTGAGCTAGTGG-1       4 FALSE  TRUE 0.3108593 130.9082
## AAACCTGCACATTAGC-1      10 FALSE  TRUE 0.7702480 123.1802
## AAACCTGCACTGTTAG-1       3 FALSE  TRUE 0.4021566 144.6951
## AAACCTGCATAGTAAG-1       3 FALSE  TRUE 0.3091744 142.0625
##                    nearest_higher_density_neighbor
## AAACCTGAGCATCATC-1                              NA
## AAACCTGAGCTAACTC-1                              NA
## AAACCTGAGCTAGTGG-1                              NA
## AAACCTGCACATTAGC-1                              NA
## AAACCTGCACTGTTAG-1                              NA
## AAACCTGCATAGTAAG-1                              NA
my_cluster_dim_10 <- pData(my_cds)$Cluster
plot_cell_clusters(my_cds)

Single-cell genomics differential expression data analysis

The main interest in single-cell transcriptome analysis is finding novel markers that show different expression patterns according to cell types. Discovering marker genes according to cell’s gene expression level can be done with differentialGeneTest() function. For this job, create a character vector, and if a particular cell belongs to cluster 1, designate ‘yes’, if not, ‘no’. You can see which genes are differentially expressed in cluster 1 apart from all the other cells.

my_vector <- rep('no', nrow(pData(my_cds))) # Repeat 'no' 
my_vector[pData(my_cds)$Cluster == 1] <- rep('yes', sum(pData(my_cds)$Cluster == 1))

# take index of my_vector where Cluster value is equal to 1, and repeat that many "yes"s and change them with that vector.

Make a new column under name of test in phenotype data of my_cds and put the values in my_vector in there for each cell (sample).

pData(my_cds)$test <- my_vector

head(pData(my_cds))
##                               barcode Size_Factor num_genes_expressed  UMI
## AAACCTGAGCATCATC-1 AAACCTGAGCATCATC-1   0.5672842                 871 2394
## AAACCTGAGCTAACTC-1 AAACCTGAGCTAACTC-1   0.4014116                 806 1694
## AAACCTGAGCTAGTGG-1 AAACCTGAGCTAGTGG-1   1.0710628                1316 4520
## AAACCTGCACATTAGC-1 AAACCTGCACATTAGC-1   0.6606467                 898 2788
## AAACCTGCACTGTTAG-1 AAACCTGCACTGTTAG-1   1.1058961                1526 4667
## AAACCTGCATAGTAAG-1 AAACCTGCATAGTAAG-1   1.0521060                1495 4440
##                    Cluster peaks  halo     delta      rho
## AAACCTGAGCATCATC-1       2 FALSE FALSE 0.1649455 113.7337
## AAACCTGAGCTAACTC-1       6 FALSE  TRUE 0.4159332 147.4402
## AAACCTGAGCTAGTGG-1       4 FALSE  TRUE 0.3108593 130.9082
## AAACCTGCACATTAGC-1      10 FALSE  TRUE 0.7702480 123.1802
## AAACCTGCACTGTTAG-1       3 FALSE  TRUE 0.4021566 144.6951
## AAACCTGCATAGTAAG-1       3 FALSE  TRUE 0.3091744 142.0625
##                    nearest_higher_density_neighbor test
## AAACCTGAGCATCATC-1                              NA   no
## AAACCTGAGCTAACTC-1                              NA   no
## AAACCTGAGCTAGTGG-1                              NA   no
## AAACCTGCACATTAGC-1                              NA   no
## AAACCTGCACTGTTAG-1                              NA   no
## AAACCTGCATAGTAAG-1                              NA   no
length(unsup_clustering_genes$gene_id)
## [1] 4012
# Differential expression of cluster 1
de_cluster_one <- differentialGeneTest(my_cds[unsup_clustering_genes$gene_id,],
                                       fullModelFormulaStr = '~test',
                                       cores = 4)

differentialGeneTest usage:

differentialGeneTest(cds, fullModelFormulaStr="~?????", cores=# (1-4)) https://rdrr.io/bioc/monocle/man/differentialGeneTest.html

So the formula string is what specifies what to analyze the difference according to. And here we have to analyze by the column test (thus ~test as fullModelFormulaStr) and its value as in yes or no.

The function tests each gene for differential expression whether they change as a function of pseudotime or according to other covariates as specified. It’s Monocle’s main differential analysis routine and it accepts CellDataSet and two model formulae as input, which specify generalized lineage models as implemented by the VGAM package. VGAM stands for vector generalized linear and additive models. It’s an implementation of about 6 major classes of statistical regression models. The central algorithm is Fisher scoring and iterative reweighted least squares. At the very heart of this package are data-driven VGLMs that use smoothing. The book “Vector Generalized Linear and Additive Models: With an Implementation in R” (Yee, 2015) gives details of the statistical framework and the package.

dim(de_cluster_one)
## [1] 4012    8

Then we order the differentially expressed genes according to q-value (adjusted p-value that signifies the statistical significance.)

de_cluster_one %>% arrange(qval) %>% head()
##   status           family          pval          qval              id
## 1     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000163131
## 2     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000204287
## 3     OK negbinomial.size  0.000000e+00  0.000000e+00 ENSG00000196126
## 4     OK negbinomial.size 1.204779e-319 9.864436e-318 ENSG00000145649
## 5     OK negbinomial.size 4.354764e-284 3.176602e-282 ENSG00000138795
## 6     OK negbinomial.size 2.162205e-236 1.188324e-234 ENSG00000182287
##   gene_short_name num_cells_expressed use_for_ordering
## 1            CTSS                5547             TRUE
## 2         HLA-DRA                6195             TRUE
## 3        HLA-DRB1                5303             TRUE
## 4            GZMA                2012             TRUE
## 5            LEF1                2519             TRUE
## 6           AP1S2                3697             TRUE

It shows that CTSS (Ensembl gene id: ENSG00000163131) gene is the most significantly differentially expressed gene.

Now we can visualize the gene’s expression level according to the cluster with plot_genes_jitter().

plot_genes_jitter(my_cds['ENSG00000163131',], grouping = "Cluster")

You can see that clusters 3, 6, 13 have CTSS gene differentially expressed (statistically significant) from cluster 1.

You can visualize this cluster on the plot with plot_cell_clusters yet again, but now with clusters 3, 6, 13 vs. others.

pData(my_cds)$my_colour <- pData(my_cds)$Cluster == 3 | pData(my_cds)$Cluster == 6 | pData(my_cds)$Cluster == 13
plot_cell_clusters(my_cds, color_by = "my_colour")

Reconstructing Single-cell Trajectories

Clustering is a classification process where you take a bunch of similar cells and cluster them together as bundles. But that’s not the end of the story. You can look at the clusters and connect them to each other due to expression pattern similarity, yet again. So the idea is, developing and differentiating cells must change their states quite continuously (but if time interval or resolution is too long, or if a drastic cell differentiation takes place at a rapid pace it will seem the cells are discrete. So you gotta be careful.) and cells are occupants of a vast, complex landscape of possible states rather than precisely defined, cut-out entities of certain types. If you connect the states together it will look like a smooth cylinder in space. (I’m not sure as to what this means. \(\rightarrow\) Refer to this bioarxiv article called “The transcriptome dynamiccs of single cells during the cell cycle”.)

By simplifying these connectivity (the relationship), looking for the central axis, and arranging these cells along that central axis, one can reconstruct the trajectory of cellular differentiation as to how its expression changes (based on the few variables, or genes, that are main components in the reduced dimension). One can think of this concept as time axis projected onto a space axis, so people often call it the “pseudo-time” analysis. There are various pseudo-time analysis methods and it would be advisable to try many different methods because there exists as many optimal ways of tackling this problem as there exists structural diversity.

Example.

The trajectory above shows the trajectory followed by olfactory neurons as they develop in mice, from progenitor cells to precursor cells to immature olfactory neurons to mature ones. Each colored point represents a cell, which are connected into a minimum spanning tree. Minimum Spanning Tree or MST is the core data structure Monocle uses to find the trajectory, shown in black lines. Each cell’s pseudotime value is measured as the distance along the trajectory from its position back to the beginning. (Traverse the line and measure the trajectory distance… this is what you get as the measurement of pseudotime.) Refer to this page at Trapnell lab that developed Monocle. Also, “The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells”

Tutorial describes Monocle like this: Monocle uses an algorithm to learn the sequence of gene expression changes each cell must go through as part of a dynamic biological process. Once it has learned the overall “trajectory” of gene expression changes, Monocle can place each cell at its proper position in the trajectory

In this practice, we use algorithms that comprise of reducing dimensionality, MST building, ordering cells in pseudotime, and labeling cells to discover the chronological (?) order of gene expression. First of all, the trajectory analysis consists of mainly three stages:

  1. Choose genes that define progress
  2. Reduce the dimensionality of the data
  3. Order cells in pseudotime

… And you might add, labeling cells.

For the trajectory analysis, I will use only a subset of all cells (clusters 3, 6, 13, designated by pData(my_cds)$my_colour) and genes that are expressed in at least 10 cells.

expressed_genes <- row.names(subset(fData(my_cds), num_cells_expressed >= 10))

# my_coolour is TRUE if a cell belongs to either cluster 3, 6, or 13.
my_cds_subset <- my_cds[expressed_genes, pData(my_cds)$my_colour]

# 15466 genes & 1904 cells
my_cds_subset
## CellDataSet (storageMode: environment)
## assayData: 15446 features, 1904 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: AAACCTGAGCTAACTC-1 AAACCTGCACTGTTAG-1 ...
##     TTTGTCATCGTCTGAA-1 (1904 total)
##   varLabels: barcode Size_Factor ... my_colour (12 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSG00000238009 ENSG00000279457 ... ENSG00000271254
##     (15446 total)
##   fvarLabels: id gene_short_name num_cells_expressed use_for_ordering
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

So we got 15466 genes (features), and 1904 cells (samples).

We’ll use the recommended approach of ordering based on genes that differ between clusters. First, we’ll perform another subset of genes, keeping only genes expressed in greater than 5% of the cell population.

First, select for only the expressed genes and get their pData

# samples (each cell)

head(colnames(my_cds_subset))
## [1] "AAACCTGAGCTAACTC-1" "AAACCTGCACTGTTAG-1" "AAACCTGCATAGTAAG-1"
## [4] "AAACCTGGTAGAAGGA-1" "AAACCTGGTTTGCATG-1" "AAACCTGTCTCGCTTG-1"
# genes

head(rownames(my_cds_subset))
## [1] "ENSG00000238009" "ENSG00000279457" "ENSG00000228463" "ENSG00000237094"
## [5] "ENSG00000237491" "ENSG00000225880"
# Select for genes that are expressed in more than 
my_cds_subset <- my_cds[expressed_genes, pData(my_cds)$my_colour]
# row = genes that are expressed in at least 10 cells
# column = TRUE for clusters 3, 6, and 13 that have CTSS significant differential expression.

my_cds_subset <- detectGenes(my_cds_subset, min_expr = 0.1)
# select for genes with expression levels greater than 0.1

fData(my_cds_subset)$use_for_ordering <- fData(my_cds_subset)$num_cells_expressed > 0.05*ncol(my_cds_subset)
# Boolean value goes into user_for_ordering to identify which genes are expressed in more than 5% of the total cell population.

table(fData(my_cds_subset)$use_for_ordering)
## 
## FALSE  TRUE 
##  9358  6088

Reduce dimensions

my_cds_subset <- reduceDimension(my_cds_subset,
                                 max_components = 2,
                                 norm_method = 'log',
                                 num_dim = 10,
                                 reduction_method = 'tSNE',
                                 verbose = TRUE)
## Remove noise by PCA ...
## Reduce dimension by tSNE ...
my_cds_subset <- clusterCells(my_cds_subset, verbose = FALSE)
## Distance cutoff calculated to 3.128448

rho_threshold: The threshold of local density (rho) used to select the density peaks

delta_threshold: The threshold of local distance (delta) used to select the density peaks

skip_rho_sigma: A logic flag to determine whether or not you want to skip the calculation of rho / sigma

my_cds_subset <- clusterCells(my_cds_subset,
                              rho_threshold = 2,
                              delta_threshold = 10,
                              skip_rho_sigma = T,
                              verbose=TRUE)
## Use the user provided rho and delta for assigning the density peaks and clusters
plot_cell_clusters(my_cds_subset)

There are total five clusters, and cluster info is stored in my_cds_subset$Cluster.

Execute DEG like we did before.

head(my_cds_subset$Cluster)
## [1] 2 3 3 3 3 4
## Levels: 1 2 3 4 5
clustering_DEG_genes <- differentialGeneTest(my_cds_subset,
                                             fullModelFormulaStr = '~Cluster',
                                             cores = 4)
# Differential anlysis according to Cluster column (what we got from clusterCells)
dim(clustering_DEG_genes)
## [1] 15446     8
clustering_DEG_genes %>% arrange(qval) %>% head()
##   status           family pval qval              id gene_short_name
## 1     OK negbinomial.size    0    0 ENSG00000163220          S100A9
## 2     OK negbinomial.size    0    0 ENSG00000163221         S100A12
## 3     OK negbinomial.size    0    0 ENSG00000143546          S100A8
## 4     OK negbinomial.size    0    0 ENSG00000203747          FCGR3A
## 5     OK negbinomial.size    0    0 ENSG00000163736            PPBP
## 6     OK negbinomial.size    0    0 ENSG00000090382             LYZ
##   num_cells_expressed use_for_ordering
## 1                1893             TRUE
## 2                1593             TRUE
## 3                1879             TRUE
## 4                 339             TRUE
## 5                 102             TRUE
## 6                1901             TRUE

Arrange the genes in order of most significantly differential expressed (smallest q-value), reduce dimensions, then pseudotime analyze to follow the trajectory of cell development/differentiation.

my_ordering_genes <- row.names(clustering_DEG_genes)[order(clustering_DEG_genes$qval)][1:1000]

# You can also write it as:
# row.names(clustering_DEG_genes[order(clustering_DEG_genes$qval)])[1:1000]

With this list of ordered genes with the most differentially expressed to the 1000th,

my_cds_subset <- setOrderingFilter(my_cds_subset, ordering_genes = my_ordering_genes)
my_cds_subset <- reduceDimension(my_cds_subset, method='DDRTree')

DDRTree is an R implementation of DDRTree algorithm from Qi Mao, Li Wang, et al. (Qi Mao, Li Wang, Steve Goodison, and Yijun Sun. Dimensionality Reduction via Graph Structure Learning. The 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’15), 2015 http://dl.acm.org/citation.cfm?id=2783309. paper in pdf).

Its rdrr page is over here.

my_cds_subset <- orderCells(my_cds_subset)
plot_cell_trajectory(my_cds_subset, color_by = "Cluster")

Q. Why does it look completely different in both R Markdown Notebook and the textbook hardcopy?

What it looks like in markdown
What it looks like in the book

4. RNA Editing Data Analysis

As previously mentioned, RNA editing is change in base after transcription and before translation while RNA is floating around. RNA editing can be classified into two categories: substitution-like (by deamination… mostly C-to-U and A-to-I.) and indel. In substitution-like RNA editing, each nucleotide is chemically modified into another base. ADAR (adenosine deaminase acting on RNA) acts on RNA nucleotides and adenosine turns to inosine (A-to-I). This contributes to about 90% of the entire RNA editing cases. RNA editing occurs throughout various RNA regions, from 5’UTR, 3’UTR, non-coding RNA, to coding regions and known to occur frequently in Alu regions. RNA editing affects the function of many transcripts, especially alternative splicing and gene expression. It’s an important molecular mechanism in regulating protein function and is associated with various human disease phenotypes. RNA editing analysis using RNA-seq can be classified into two types:

  1. The first is excavating novel RNA editing phenomena by comparing RNA-Seq and DNA-Seq data of the same sample. VCF files from RNA-seq and DNA-seq are compared side-by-side and the base where chromosome, position, and reference sequence are the same but have different variants is reported as where RDD (RNA-DNA Difference) phenomenon occurs. Called RDDs are RNA editing candidates.

  2. Second is comparing RNA editing position with reference sequence to find RDDs (simply resequencing) and confirming these RDDs by mapping them to known RNA editing positions. Database DARNED (DAtabase of RNa EDiting in humans) provides about 42,000 known RNA editing positions.

After finding RDDs and selecting for RNA editing positions, we gotta annotate. ExpEdit(http://www.caspur.it/ExpEdit/) and RCARE (RNA-Seq Comparison and Annotation for RNA Editing; http://www.snubi.org/software/rcare/) exist for this purpose. ExpEdit only supports hg 18, so it’s not relevant today. RCARE uses hg 19 reference genome and it can both annotate biological functions and visualize the analysis. In this exercise, we’re gonna use HeLa cell line data that originates from cervical cancer to learn how to analyze RNA editing information with RCARE.

Here’s a review article on RNA editing and RNA editing analysis and annotation pipelines by Su Youn Lee and Prof. Ju Han Kim: Bioinformatics Approaches for the Identification and Annotation of RNA Editing Sites. It says there have been many novel RNA editing discovery tools developed with the advent of high-throughput NGS technology, but each pipeline has distinct shortcomings. For example, a lot of false positives have been a problem. It’s in part due to inherent limitation of currently available sequencing technologies and also in part due to incompleteness of bioinformatics analysis methods.

So let’s look at RNA editing pipeline. This picture is from the forementioned review paper.

Steps for identifying RNA editing sites. (A) Identification of novel RNA editing sites using both DNA and RNA sequences. (B) Annotating RNA-Seq data with known RNA editing sites
  • Discovery of novel RNA editing sites: Compare RNA and DNA VCFs and obtain RNA-DNA differences (RDDs). After detecting RDDs, filter out known SNPs and known RNA editing sites using 1000 genome, HapMap, dbSNP for the former and DARNED for the latter.

Many RNA editing sites have been reported to be present in a variety of human diseases. To list a few, there have been reports of RNA editing existent in epilepsy, brain ischemia, depression and brain tumors.3 However, RNA editing seems to be one of the regulatory process to ensure biological diversity in our body.

Apolipoprotein B (human APOB) gene is a representative case of such diverse phenotype of one gene. APOB gene consists of 29 exons and total of 4,564 codons, including the stop codon. This gene is expressed in liver cells (hepatocytes) and gastrointestinal tract (especially intestinal cells: duodenum, small intestine, colon). In liver cells, all 4563 codons are translated into protein. This is ApoB-100, which is the largest of the apoB group of proteins. but in intestinal cells 2153th codon (CAA) turns into UAA by cytidine deaminase (this is the trans-acting tissue-specific RNA editing gene that determines which isoform is ultimately produced) and gets converted into stop codon. This is ApoB48. Therefore, ApoB48 and ApoB100 share a common N-terminal sequence, but ApoB48 lacks ApoB100’s C-terminal LDL receptor binding region; therefore does not bind to LDL receptors. ApoB48 exists exclusively in small intestine chylomicrons.^[[http://www.incodom.kr/RNA_editing]

Here is a list of resources for RNA editing research.

RDD (RNA-DNA difference) site identification practice dataset and analytical tool

The practice dataset is ENCODE (http://genome.ucsc.edu/ENCODE) data of HeLa cell line’s DNA and RNA VCF (partial).

[soh1@x020 RDD]$ head DNA.vcf
##fileformat=VCFv4.1
#CHR    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HeLa-S3
1   89923   .   A   T   31.8    PASS    DP=3;VDB=6.240000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33  GT:PL:GQ    1/1:63,6,0:10
1   134667  .   A   G   87.3    PASS    DP=5;VDB=6.970230e-02;AF1=1;AC1=2;DP4=0,0,4,1;MQ=50;FQ=-42  GT:PL:GQ    1/1:120,15,0:27
1   138875  .   G   A   21  PASS    DP=9;VDB=8.150059e-02;RPB=0.000000e+00;AF1=0.5002;AC1=1;DP4=3,0,3,1;MQ=46;FQ=6.16;PV4=1,1,0.22,1    GT:PL:GQ    0/1:51,0,32:35
1   321061  .   C   G   26  PASS    DP=21;VDB=8.828460e-02;RPB=-1.030420e+00;AF1=0.5;AC1=1;DP4=9,1,6,0;MQ=41;FQ=29;PV4=1,1,0.46,1   GT:PL:GQ    0/1:56,0,91:59
1   564461  .   A   G   27.8    PASS    DP=2;VDB=5.960000e-02;AF1=1;AC1=2;DP4=0,0,2,0;MQ=50;FQ=-33  GT:PL:GQ    1/1:59,6,0:10
1   564467  .   A   G   23  PASS    DP=4;VDB=5.960000e-02;RPB=8.745357e-01;AF1=1;AC1=2;DP4=1,0,2,0;MQ=41;FQ=-30;PV4=1,0.23,1,1  GT:PL:GQ    1/1:53,3,0:6
1   567242  .   G   C   222 PASS    DP=222;VDB=4.677715e-02;RPB=9.763200e+00;AF1=1;AC1=2;DP4=41,0,120,7;MQ=43;FQ=-120;PV4=0.2,4.8e-07,1,1   GT:PL:GQ    1/1:255,93,0:99
1   569717  .   T   C   125 PASS    DP=17;VDB=6.710477e-02;RPB=1.448414e+00;AF1=0.7304;AC1=1;DP4=1,0,4,7;MQ=46;FQ=-25;PV4=0.42,0.14,0.34,1  GT:PL:GQ    0/1:154,0,2:5
DNA.vcf
[soh1@x020 RDD]$ head RNA.vcf
##fileformat=VCFv4.1
#CHR    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HeLa-S3
1   89862   .   T   G   42  PASS    DP=3;VDB=5.980293e-02;AF1=1;AC1=2;DP4=0,0,3,0;MQ=50;FQ=-36  GT:PL:GQ    1/1:74,9,0:16
1   90789   .   G   C   38.5    PASS    DP=6;VDB=6.499381e-02;AF1=1;AC1=2;DP4=0,0,3,1;MQ=43;FQ=-39  GT:PL:GQ    1/1:71,12,0:21
1   91581   .   G   A   23.8    PASS    DP=2;VDB=7.040000e-02;AF1=1;AC1=2;DP4=0,0,0,2;MQ=50;FQ=-33  GT:PL:GQ    1/1:55,6,0:10
1   142492  .   G   A   80.3    PASS    DP=6;VDB=9.464066e-02;AF1=1;AC1=2;DP4=0,0,4,1;MQ=50;FQ=-42  GT:PL:GQ    1/1:113,15,0:27
1   320229  .   G   T   47  PASS    DP=3;VDB=3.576708e-02;AF1=1;AC1=2;DP4=0,0,3,0;MQ=50;FQ=-36  GT:PL:GQ    1/1:79,9,0:16
1   322852  .   T   C   44  PASS    DP=6;VDB=5.498856e-02;RPB=1.291774e+00;AF1=0.5;AC1=1;DP4=0,2,1,2;MQ=50;FQ=18.1;PV4=1,0.33,1,0.02    GT:PL:GQ    0/1:74,0,45:48
1   564654  .   G   A   125 PASS    DP=21;VDB=2.406943e-02;AF1=1;AC1=2;DP4=0,0,17,0;MQ=49;FQ=-78    GT:PL:GQ    1/1:158,51,0:99
1   566849  .   G   A   211 PASS    DP=396;VDB=6.920000e-06;RPB=-1.075361e+00;AF1=1;AC1=2;DP4=0,1,4,206;MQ=45;FQ=-282;PV4=1,0.45,0.32,0.1   GT:PL:GQ    1/1:244,255,0:99
RNA.vcf

If you look at both files and compare them, there must be where they are same in chromosome number and position, but have different alt alleles. The following is the python code that does just that:

r_dra = {} # DNA

#1    89862   .   T   G   42    PASS    DP=3;VCF=5.980293e-02;AF1=1;AC1=2;DP4=0,0,3,0;MQ=50;FQ=-36    GP:PL:GQ    1/1:74,9,0:16
#0    1   2   3   4   5   6   7   8   9

f=open('DNA.vcf', 'r') # Open DNA.vcf
for line in f.readlines(): # read all lines iteratively
  if line.startswith('#'): pass # If it's a header, just pass.
  else: # for other lines...
    f_sp = line.strip().split('\t')
    r_dra[f_sp[0]+"\t"+f_sp[1]] = f_sp[2:] # make "chr num and position" as key and the rest (a list) as val.
f.close()

RDD_site=[] # RDD site list is an empty list.

f1=open('RNA.vcf', 'r')
for f1_line in f1.readlines():
  if f1_line.startswith('#'): pass
  else:
    f1_sp = f1_line.strip().split('\t')
    if('\t'.join(f1_sp[0:2])) in r_dra : # if the key "chr num and position" exists in DNA vcf data,
      DNA = r_dra['\t'.join(f1_sp[0:2])] # ALT seq data (the rest...)
      
      if f1_sp[4] != DNA[2]: # if RNA alt seq and DNA alt seq are NOT the same!
        RDD_site.append('chr'+f1_sp[0]+'\t'+'\t'.join(f1_sp[1:])) 
        # join all the info as tab-delimited string and append to the RDD list.
        
f1.close()

re=open('RDD.vcf', 'w') # Open and write a new file (in 'w' mode)

re.write('##fileformat=VCFv4.1'+'\n') # The first line describes file format, and then newline
re.write('#CHR\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tHeLa-RDD\n') # Header: column info.
re.write('\n'.join(RDD_site)) # Add all lines in RDD list joined by newline

re.close()

RDD annotation

Let’s annotate the RDD.vcf file that we got from above process (lines of RNA.vcf that have different ALT seq from DNA.vcf ALT seq although the chromosome number and position are identical). Go to RCARE, and upload the RDD.vcf file we made from comparing RNA and DNA vcf files. If you don’t have the vcf, download the preprocess utilities shown on top of Annotation tab.


  1. This means template having multiple segements in sequencing, SEQ of the next segment in the template being reverse complemented, the first segment in the template, and secondary alignment. Secondary alignment means that it’s got multiple places that the read aligns to reasonably well. This topic on biostar talks about secondary alignments in BAM. It’s different from duplicate (0x400 in FLAG) - where as duplicates are created by abnormal amplification of specific fragments during PCR of library preparation process due to technical issues. This is a nice article about how duplciates arise.↩︎

  2. Here’s a good tutorial on cowplot↩︎

  3. A-to-I RNA editing and human disease↩︎